Beat tracking is the task of defining the points in time where a human listener would tap the beat if listening to a song.
Beat
Sequence of equally spaced impulses, which periodically occur in
music. The perceptually most salient pulse (foot tapping rate)
Tempo
tempo refers to the rate of the musical beat and is given by the reciprocal of the beat period. Tempo is often defined in units of beats per minute (BPM).
Both tempo and beats are fundamental property of a music
Note onset
When playing a note on an instrument such as a piano, there is often a sudden increase of energy at the beginning of a musical tone.
attack of a note refers to the phase where the sound builds up, which typically goes along with a sharply increasing amplitude envelope.
transient refers to a noise-like sound component of short duration and high amplitude typically occurring at the beginning of a musical tone
onset of a note refers to the single instant (rather than a period) that marks the beginning of the transient, or the earliest time point at which the transient can be reliably detected.
It can be formulated as an acoustic event detection problem with the signal under processing being a piece of music and the
events being the note onsets.
onset candidates may be determined by locating time positions where the signal's amplitude envelope starts increasing.
When there is no clear attack phase, such as for nonpercussive music with soft onsets and blurred note transitions, the detection of onsets is much more challenging. For example, the waveform of a violin sound may exhibit a slow energy increase rather than an abrupt change as in a piano sound. For soft sounds, it is hard to determine or even to define the exact onset position.
The detection of individual note onsets becomes even harder when dealing with complex polyphonic music. Simultaneously occurring sound events may result in masking effects, where no significant changes in the signal's energy are measurable.
Eg.Tapping of foot in music
challenges -
from IPython.display import YouTubeVideo
YouTubeVideo('rY0WxgSXdEE', width=600, height=300)
Note Onset Detection (NOD) methods
In literature, their are methids for of NOD data-driven and non-data-driven
methods.
Data driven - methods build statistical models by using ML algos
on a large set of training data. For instance, learning a probabilistic (hidden Markov) model exploiting the rhythmic regularity in music, training a neural network. These machine learning methods can either solve a classification problem, differentiating onsets from non-onsets, or a regression problem of which the output is then used to estimate a suitable ODF( onset detection function )
Non-Data driven - methods, the ODF is directly calculated from the analyzed signal or its extracted features, they may operate in time or frequency domain and differ in the type of features extracted from the analyzed signal: magnitude, phase, energy, power spectral density, or time
derivatives of these.
state of art non-data-driven uses complex flux feature.
1. Spectogram (STFT)
To detect node onset first step is to detect change in energy signature or frequency content, for that we apply STFT to convert signal into spectogram, below image describe magnitude of spectogram.
2. Log compression
Allow to adjust to dynamic range of the signal to enhance weaker spectral component specifically in high frequency range.
3. Differentiation
computing discreate derivative of above log compressed spectrum, in a row wise fashion for each frequency band separately. Only positive differeces are considered since we are interested in only energy increases.
4. Accumalation
All the positive differency obtained above are summed in column wise fashion which yield a single positive number at each point in time. As a result a curve is obtained which is also known as novality curve.
The peak indicate time position where major changes happended in spectrum.
We compute a local average curve which is then subracted from novelity curve
5. Normalization
Normalizing the novelity curve, peaks of novelity curves are good candidates for node onsets, as they signify positive energy changes.
6. clipping peaks
Peaks of the novelty curve are note onset candidates, we can use thresholding to extract node onset. Peak-picking is a very fragile step in particular for soft onsets (strings)
summary -
(thresholding)
Objectives -
For tempo estimation we will make certain assumtion that in a certain time window beat occur in periodic fashion.
We are going to analyze the noveity curve for this periodic signals using
autocorrelation tempogram
Grosche, Peter, Meinard Müller, and Frank Kurth. “Cyclic tempogram - A mid-level tempo representation for music signals.” ICASSP, 2010.
1. time shifted copies of novelity curve -
Here we compare novelity curve with time shifted copies of itself.
2. autocorrelation -
We take a window and compute autocorrelation while moving the window across the signal.
3. correlation plot -
The values obtained by correlation is autocorrelation plot, the correlation plot will provide
a. high values in for time lags where their is a high correlation
b. Regions which are self similar are revieled
4. convert time-lag into tempo (BPM)
Tempo (in BPM ) = 60 / Lag (in sec)
5. Interpolation -
Interpolate to a linear tempo axis in a musically meaningful tempo range
6. Generating tempogram -
We will apply above principle of autocorrelation to estimate the tempo at every segment in the novelty function.
Below figure specify that given value in tempogram point to given section in novelity curve
Here we will convert lag seconds into tempo (BPM) refer point 4, we can see that values highlighted in below figure are same values which we got in section 5.
Rescaling the tempogram
Fourier tempogram
The Fourier Tempogram is basically the magnitude spectrogram of the novelty function.
Inorder to get Fourier tempogram compute the short-time Fourier transform (STFT) of the novelty function. Since the novelty function is computed in frame increments, the hop length of this STFT should be pretty small:
Above Frequency axis is interpreted as tempo in BPM instead of
frequency in Hz
For practical applications when computing Fourier tempogram (tf) using STFT we use only a small number of tempo parameters, eg [30:600] covering the (integer) musical tempi between 30 and 600 BPM
The bounds are motivated by the assumption that only musical events showing a temporal separation between roughly 100 msec ( 600 BPM ) and 2 sec ( 30 BPM ) contribute to the perception of tempo.
a window size corresponding to 4 to 12 sec of audio is a reasonable range.
The visualization of fourier tempogram reveals dominant tempo over time. In above Fourier tempogram reveals, the dominant tempo of this excerpt is between 200 and 300 BPM, the tempo slightly increases over time. the tempogram also shows some larger coefficients that correspond to 1/3 and 2/3 of the main tempo.
The idea of Fourier analysis is to detect local periodicities in novelty curve by comparing it with windowed sinusoids.
A high correlation of a local section of novelty curve with a windowed sinusoid indicates a periodicity of the sinusoid's frequency (given a suitable phase).
This correlation (along with the phase) can be computed using a short-time Fourier transform.
Fourier vs Autocorrelation
| Fourier Tempogram | Autocorrelation Tempogram |
|---|---|
| Compare the novelty curve with templates consisting of sinusoidal kernels each representing a specific tempo | Compare the novelty curve with time-shifted copies of itself |
| Reveals periodic sequences of peaks | Reveals periodic self-similarities |
| Emphasizes harmonics, i.e. multiples of the tempo: Tatum - Level | Emphasizes subharmonics, i.e. fractions of the tempo: Measure - Level |
1. Local sinusoid
Compute a single entry of tempogram by considering a suitable window sinusoid of correponsing frequency and then it is correlated with section of novelity curve.
Entry is defined by correlation value, in other words this value is magnitude of suitable fourier constant.
2. Maximizing coefficient of window sinsoid
For each time position or column we consider the maximizing coefficient (Each such coefficient defines a windowed sinusoid of local periodicity kernel, which in turn explain the local periodic behaviour of novelity curve)
3. Accumulate
Accumulate all maximizing kernels over time, such as 1 kernel per frame. By above we will obtain a single curve
4. halfwave rectification
setting all -ve values to zero
5. Predominant Local Pulse (PLP)
Original novelity curve is noisy, having peaks which are small and blurred.
PLP curve is much more refined and also doesnt loose any original information, on the other hand it enhances certain peaks. It can recover weak onsets present in novelity curve. It captures sudden changes in local, meaning if in a music file suddently tempo increases PLP will be able to capture that due to accumulation and local periodicity kernel.
It can be regarded as periodicity enhancement of original novelity curve.
| Novelity curve | Predominant Local Pulse (PLP) curve |
|---|---|
| Indicates note onset candidates | Periodic enhancement of novelity curve |
| Extraction err in particular soft onsets | Accumulation introduce error robustness |
| Simple peak picking problem | Locality of kernels handles tempo variations |
6. Predominant Local Pulse (PLP) - modification
one can improve PLP by introducing prior knowledge, one can restrict the peridicity kernel periodicity in certain range for example the kernel with range 100-150 ppm will be allowed in our PLP.
increase duration of kernel (windows)
import librosa
librosa.show_versions()
INSTALLED VERSIONS ------------------ python: 3.7.13 (default, Apr 24 2022, 01:04:09) [GCC 7.5.0] librosa: 0.8.1 audioread: 2.1.9 numpy: 1.21.6 scipy: 1.4.1 sklearn: 1.0.2 joblib: 1.1.0 decorator: 4.4.2 soundfile: 0.10.3 resampy: 0.2.2 numba: 0.51.2 numpydoc: None sphinx: 1.8.6 sphinx_rtd_theme: None sphinxcontrib.versioning: None sphinx-gallery: None pytest: 3.6.4 pytest-mpl: None pytest-cov: None matplotlib: 3.2.2 presets: None
import matplotlib.pyplot as plt
from librosa import display
import numpy as np
import IPython
filename = librosa.example('choice')
Downloading file 'admiralbob77_-_Choice_-_Drum-bass.ogg' from 'https://librosa.org/data/audio/admiralbob77_-_Choice_-_Drum-bass.ogg' to '/root/.cache/librosa'.
IPython.display.Audio(filename)
Loading Audio file
frames_wave, sampling_rate = librosa.load(filename)
print("frames_wave:", frames_wave.shape, "sampling_rate:", sampling_rate)
frames_wave: (551823,) sampling_rate: 22050
hop_length = 200 # samples per frame
onset_env = librosa.onset.onset_strength(frames_wave, sr=sampling_rate, hop_length=hop_length, n_fft=2048)
frames = range(len(onset_env))
frames_time = librosa.frames_to_time(frames, sr=sampling_rate, hop_length=hop_length)
figsize =(20, 6)
plt.figure(figsize=figsize)
plt.plot(frames_time, onset_env) # plot onset_env
plt.xlim(0, frames_time.max())
plt.ylim(0)
plt.xlabel('Time (sec)')
plt.title('Novelty Function')
Text(0.5, 1.0, 'Novelty Function')
Since the novelty function is computed in frame increments, the hop length of this STFT should be pretty small
stft_novelity_curve = librosa.stft(onset_env, hop_length=1, n_fft=512)
fourier_tempogram = np.absolute(stft_novelity_curve)
plt.figure(figsize=(20, 6))
librosa.display.specshow(fourier_tempogram, sr=sampling_rate, hop_length=hop_length, x_axis='time')
<matplotlib.collections.QuadMesh at 0x7fd836222110>
Consider a segment from the above novelty function:
len(frames_time)
2760
n0 = 100
n1 = 500
plt.figure(figsize=(20, 6))
plt.plot(frames_time[n0:n1], onset_env[n0:n1])
plt.xlim(frames_time[n0], frames_time[n1])
plt.xlabel('Time (sec)')
plt.title('Novelty Function')
Text(0.5, 1.0, 'Novelty Function')
log_novelity_function = np.log1p(onset_env[n0:n1])
autocorr_novelity_function = librosa.autocorrelate(log_novelity_function)
plt.figure(figsize=(20, 6))
plt.plot(frames_time[:n1-n0], autocorr_novelity_function)
plt.xlim(frames_time[0], frames_time[n1-n0])
plt.xlabel('Lag (sec)')
plt.ylim(0)
(0.0, 177.66184640120312)
Wherever the autocorrelation is high is a good candidate of the beat period.
plt.figure(figsize=(20, 6))
plt.plot(60/frames_time[:n1-n0], autocorr_novelity_function)
plt.xlim(20, 200)
plt.xlabel('Tempo (BPM)')
plt.ylim(0)
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: RuntimeWarning: divide by zero encountered in true_divide
(0.0, 162.9275689324168)
We will apply this principle of autocorrelation to estimate the tempo at every segment in the novelty function.
tempogram = librosa.feature.tempogram(onset_envelope=onset_env, sr=sampling_rate, hop_length=hop_length, win_length=400)
plt.figure(figsize=(20, 6))
librosa.display.specshow(tempogram, sr=sampling_rate, hop_length=hop_length, x_axis='time', y_axis='tempo')
<matplotlib.collections.QuadMesh at 0x7fd831e40190>
tempo = librosa.beat.tempo(frames_wave, sr=sampling_rate)
print(tempo)
[135.99917763]
onset_env = librosa.onset.onset_strength(y=frames_wave, sr=sampling_rate, aggregate=np.median)
# tempo, frames_beats = librosa.beat.beat_track(y=frames_wave, sr=sampling_rate)
# print("tempo:", tempo, "\nframes_beats:", frames_beats)
tempo, frames_beats = librosa.beat.beat_track(onset_envelope=onset_env, sr=sampling_rate)
print("tempo:", tempo, "\nframes_beats:", frames_beats)
tempo: 135.99917763157896 frames_beats: [ 3 21 40 59 78 96 116 135 154 173 192 211 230 249 268 287 306 325 344 363 382 401 420 439 458 476 495 515 534 553 572 591 610 629 647 667 686 705 724 743 762 780 799 819 838 857 876 895 914 933 951 971 990 1008]
frames_beats_time = librosa.frames_to_time(frames_beats, sr=sampling_rate)
frames_beats_time
array([ 0.06965986, 0.48761905, 0.92879819, 1.36997732, 1.81115646,
2.22911565, 2.69351474, 3.13469388, 3.57587302, 4.01705215,
4.45823129, 4.89941043, 5.34058957, 5.78176871, 6.22294785,
6.66412698, 7.10530612, 7.54648526, 7.9876644 , 8.42884354,
8.87002268, 9.31120181, 9.75238095, 10.19356009, 10.63473923,
11.05269841, 11.49387755, 11.95827664, 12.39945578, 12.84063492,
13.28181406, 13.7229932 , 14.16417234, 14.60535147, 15.02331066,
15.48770975, 15.92888889, 16.37006803, 16.81124717, 17.2524263 ,
17.69360544, 18.11156463, 18.55274376, 19.01714286, 19.458322 ,
19.89950113, 20.34068027, 20.78185941, 21.22303855, 21.66421769,
22.08217687, 22.54657596, 22.9877551 , 23.40571429])
hop_length = 512
figsize =(20, 6)
fig, ax = plt.subplots(nrows=2, sharex=True, figsize=figsize, dpi=72)
times = librosa.times_like(onset_env, sr=sampling_rate, hop_length=hop_length)
M = librosa.feature.melspectrogram(y=frames_wave, sr=sampling_rate, hop_length=hop_length)
librosa.display.specshow(librosa.power_to_db(M, ref=np.max),
y_axis='mel', x_axis='time', hop_length=hop_length,
ax=ax[0])
ax[0].label_outer()
ax[0].set(title='Mel spectrogram')
ax[1].plot(times, librosa.util.normalize(onset_env),
label='Onset strength')
ax[1].vlines(times[frames_beats], 0, 1, alpha=0.5, color='r',
linestyle='--', label='beats_frame')
ax[1].legend()
<matplotlib.legend.Legend at 0x7fd831d299d0>
IPython.display.Audio(frames_wave, rate=sampling_rate)
clicks = librosa.clicks(frames_beats_time, sampling_rate, length=len(frames_wave))
IPython.display.Audio(frames_wave + clicks, rate=sampling_rate)
https://asmp-eurasipjournals.springeropen.com/track/pdf/10.1186/s13636-021-00214-7.pdf
https://librosa.org/doc/main/generated/librosa.onset.onset_backtrack.html
https://librosa.org/doc/main/generated/librosa.beat.beat_track.html
https://www.audiolabs-erlangen.de/resources/MIR/FMP/C6/C6S2_TempogramFourier.html
https://www.audiolabs-erlangen.de/resources/MIR/FMP/C6/C6S1_OnsetDetection.html